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20 



The present invention relates generally to natural language processing and 
more particularly relates to a system and method for determining the similarity of text 
in short passages. 



With the growing volume of textual information, such as newspaper articles, 
magazines, Internet articles, and the like, there is a growing need to automatically 
cluster and/or classify such documents and determine whether groups of documents 
express similarities or not. For the most part, research in this area has focused on 
detecting similarity between documents and large segments of text or between a short 
query phrase and one or more documents. 

While effective techniques have been developed for document clustering and 
classification which depend on inter-document similarity measures, these techniques 
generally rely only on shared words, or occasionally on collocation of words. Such 
techniques are applicable when large units of text, such as full documents, are 
compared. In this case, there is generally sufficient overlap to detect similarity in the 
documents and/or document segments. However, when the units of text are small, for 
example a paragraph or abstract, such simple surface matching of words and phrases 
is far more prone to error. In the case of small text units, the sample size is reduced 
and the number of potential matches is reduced accordingly. Thus, there remains a 
need for improved techniques for detecting similarities between small text units. 

A further problem with known techniques for detecting similarity is that the 
conventional notions of similarity which are applicable to large text samples, such as 
documents and large text segments, do not provide sufficient measures of similarity 
for measuring similarity in small text segments. Standard notions of similarity 
generally involve the creation of a vector or profile of characteristics of a text 
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fragment and determine a conceptual distance between vectors on the basis of 
frequencies. Features typically include stenuned words, although multi-word units 
and collocations also have been used. Typological characteristics, such as thesaural 
features, have also been used to calculate features. The difference between vectors for 

5 one text xmit (usually a query) and another text unit (usually a document) then 
determines closeness or similarity of the text units. 

in some cases", the text units are representea as vectors ol sparse n-grams of 
word occurrences and learning is applied over those vectors. Though effective in the 
context of large document comparisons, a more fine-grained distinction for similarity 

1 0 measures is required to properly characterize the similarity of two small text 
segments. 

SUMMARY OF THE INVENTION 
It is an object of the present invention to provide systems and methods for 
detecting similarity between two or more small text segments. 

1 5 A method for determining similarity in short text segments in accordance with 

the present invention includes the steps of determining common primitive features in 
the text segments, determining common composite features in the text segments and 
then calculating a similarity measure based upon the primitive and composite 
features. The primitive features can be selected firom the group including conunon 

20 single words, cominon noun phrases, synonyms, common semantic classes of verbs, 
and common proper nouns. The composite features, which represent relationships 
between and among the primitive features, can be selected from the group including 
primitive feature order restrictions, primitive feature distance restrictions, and 
primitive type restrictions. 

25 Preferably, the step of determining common primitive features can include the 

further steps of identifying common primitive features, assigning a value to the 
primitive features, and normalizing the feature values. Normalizing the values can 
include normalizing for text segment length and normalizing for the frequency of 
primitive feature occurrence. Similarly, determining composite features generally 

30 includes identifying the composite features, assigning a value to the composite 
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features, and normalizing the feature values. Again, normalization of the feature 
values can include normalizing for text segment length and noraializing for the 
frequency of feature occurrence. 

BRIEF DESCRIPTION OF THE DRAWING 
5 Further objects, features and advantages of the invention will become apparent 

trom the tbllowing detailed description taken in conjunction with the accompanying 
figures showing illustrative embodiments of the invention, in which 

Figure 1 is a flow chart illustrating an overview of a present method for 
comparing small text segments; 
10 Figure 2 is a flow chart illustrating the step of defining similarity for small text 

segments in accordance with the present methods; 

Figure 3 is a flow chart illustrating the process of computing primitive features 
for use in detecting similarity in small text segments; 

Figure 4 is a flow chart illustrating the process of calculating composite 
1 5 features for use in detecting similarity of small text segments in accordance with the 
present methods; 

Figure 5 is a block diagram of a software system topology for detemiining 
similarity in small text segments in accordance with the present methods; 
Figure 6 is an illustration of exemplary short text segments; 
20 Figure 7 is a diagram illustrating a composite feature match between two of 

the short text segments provided in Figure 6 using a "same order" rule; 

Figure 8 is a diagram illustrating a composite feature match between two of 
the short text segments provided in Figure 6 using a "within distance" rule; and 

Figure 9 is a diagram illustrating a composite feature match between two of 
25 ^ the short text segments provided in Figure 6 using a "primitive type" rule. 

Throughout the figures, the same reference numerals and characters, unless 
otherwise stated, are used to denote like features, elements, components or portions of 
the illustrated embodiments. Moreover, while the subject invention will now be 
described in detail with reference to the figures, it is done so in connection with the 
30 illustrative embodiments. It is intended that changes and modifications can be made 
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to the described embodiments without departing from the true scope and spirit of the 
subject invention as defined by the appended claims. 



DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
Figure 1 is a flow chart illustrating an overview of the process used in the 
present invention for detecting similarity in small text segments. As previously noted, 
a problem m the pnor art is that the detinition ot similarity commonly used for large 



text segments, such as documents, is not sufficiently refined to provide an adequate 
measure of similarity when comparing small text segments. Generally, small text 
segments refer to sentences, phrases and short paragraphs. 

10 Referring to Figure 1, in step 100 a definition of similarity for small text 

segments is provided. From this definition, the method proceeds to identify primitive 
features of the small text segments and determine feature values for the primitive 
features (step 105). Primitive features are those which generally compare simple parts 
of speech and text, such as single words, word categories, or phrases such as noxm 

1 5 phrases, synonyms, verb class and proper nouns. In addition to primitive features, the 
process can identify composite features of the short-text segments and determine 
composite feature values (step 110). Composite features are those which compare 
relationships among two oir more primitive features. Once primitive features and 
composite features have been identified and given an appropriate value, a machine 

20 learning algorithm is applied to classify small text segments as similar or not similar 
(step 115). 

Figiire 2 is a flow chart which illustrates the process of establishing an 
appropriate definition of similarity for small text segments. In general, two text units 
can be considered as similar if they share the same focus on a common concept, actor, 

25 object or action. In addition, the common actor or object definition must perform or 
be subjected to the same action or be the subject of the same description. This is 
exemplified in the flow chart of Figure 2, where two small text segments are selected 
from a body of text and are analyzed. If the two text segments relate to a common 
concept (step 205), then further analysis is performed to see if the common concept 

30 relates to the same action (step 2 1 0) or relates to the same description (step 2 1 5). 
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Similar tests are performed to determine if the two text segments relate to a common 
actor (step 220) or to a common object (step 225). If there is no common concept, 
actor or object, the text segments are considered not similar (step 235). Similarly, for 
those text segments which do refer or relate to a common concept, actor or object, 

5 those segments will still be found not similar unless they also relate to a common 
action or involve the same description. Thus, for short text segments to be similar, 
they musf contam a common concept, actor, or object which is also the subject of a 
common action or description* The comparisons in steps 205, 220 and 225 can be the 
basis for primitive features 240. Those relationships between primitive features which 

1 0 are identified in steps 2 1 0, 21 5 can be referred to as composite features 245. 

While Figure 2 is illustrated as a sequential process, it represents a decision 
tree involved in a definition of similarity of two short text segments as applied in the 
present invention which can also be performed in a largely parallel manner. For 
example, decisions 205, 220 and 225 can be performed concurrently as can decisions 

15 2 1 0 and 2 1 5 . Using this definition of similarity for small text segments, a feature- 
based process can be employed which compares primitive and composite features of 
short text segments to determine if the definition is satisfied for two or more given 
input text segments. 

Figure 3 is a flow chart which illustrates a method for extracting and scaling 

20 primitive features in accordance with the present invention. The text segments are 
compared for a level of commonality, including determining whether there is a 
conunon single word (step 305), a common noun phrase (step 310), whether two 
words in the phrases are synonyms (step 315), whether the phrases include verbs 
having a common semantic class (step 320), and whether a common proper noun can 

25 be found in the two phrases (step 325). If none of these conditions are satisfied for the 
applied small text segments, there is no primitive feature conmion to these two text 
segments (step 327). When a primitive feature has been identified, e.g., one of the 
conditions in steps 305 through 325 is satisfied, a feature value is assigned to that 
primitive feature. Preferably, the values which are assigned to the features are 

30 determined by a machine learning algorithm, such as RIPPER, which is trained using 
a suitable training corpus. RIPPER is a widely-used and effective rule induction 
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system which is available from ATi&T Laboratories and is described by Cohen in 
"Learning Trees and Rules with Set-Valued Features, Proceedings of the Fourteenth 
National Conference on Artificial Intelligence, American Association on Artificial 
, Intelligence, 1996, which is incorporated by reference. It has been found that a sub- 

5 set of a corpus of 264 paragraphs which have been manually tagged by human readers 
as similar or not similar can be used to establish a feature mle set for RIPPER which 
IS then suitat)ie tor assignmg values to the teatures identitied in the text segments. 
The particular training corpus and learned rule set will generally vary depending on 
the desired application. The values assigned will vary based on properties of the 

1 0 machine learning algorithm and training corpus. After feature values are assigned in 
step 330, these values can be normalized based on text length (step 335) and/or noted 
frequency of occurrence (step 340). Though normalization is optional, it is a desirable 
step to provide uniform and accurate results across varying types of text and length of 
text segments. 

1 5 Primitive features provide a baseline indication of similarity. To further refine 

the notion of similarity in small text segments, relationships among primitive features, 
referred to as composite features, can also be evaluated. Referring to Figure 4, a 
method of evaluating composite features is illustrated. Composite features are those 
features which identify relationships among primitive feature pairs. Generally, 

20 composite features are defined by placing different forms of restrictions on 

participating primitive feature pairs. Referring to Figure 4, the primitive features 
identified in each of the small text segments are applied to a test layer 400 where 
various feature relationships are evaluated. The relationships illustrated in test layer 
400 are exemplary in nature and are not intended to illustrate an exhaustive list of 

25 possible relationships. It will be appreciated that an large number of relationships 
between and among primitive features can be used to establish composite features. 

For example, one type of feature relationship for composite features can be 
that the primitives occur in the same order in each of the text samples (step 405). This 
is illustrated by example in Figure 7. Figure 6 provides three short text segments to 

30 be compared. Figure 7 illustrates.a match according to the "same order" composite 
feature rule. In Figures 7-9, primitive features are identified by shading and the 



wo 00/79426 ^r-™ .w. PCr/USOO/40238 

BEST AVAfLAOLE COPY 

7 

relationships which form the composite features are illustrated by connecting lines. In 
the case illustrated in Figure 7 the primitive features {two, contact} appear in the same 
order in text segments Figure 6 (a) and 6 (b) from Figure 6. 

Another possible relationship is that two pairs of primitive elements are 
5 required to occur within a certain distance in both text segments. The maximum 

distance between the primitive elements which would satisfy the relationship can be a 
variable or a predetermined constani (^siep 410j. Refemngio figure 8, an example of a 
positive match for the "within distance" composite feature rule is provided, given that 
the distance, w, is set to a value less than three. In Figure 8, although the primitive 
1 0 features {contact, lost} do not appear in the same order, they occur within n words of 
each other (n<3 in this case). 

Yet another exemplary relationship can be that the two text segments include 
the same primitive feature types. For example, one primitive feature can be restricted 
to a simplex noun phrase while the other to a verb. In such a case, two noun phrases, 
1 5 one from each text imit, must match according to the rule for matching simplex noun 
phrases and two verbs must match according to the applied rules of verb primitives 
(e.g., sharing the same semantic class). This is illustrated in Figure 9 where the 
primitive feature "An OH-58 helicopter" is deemed a simplex noun phrase match with 
"the helicopter" and both phrases include a conunon verb, "lost". 
20 By matching primitive feature types, a simple granunatical relationship is 

determined in the text segments. Returning to Figure 4. for each condition that is 
satisfied in test layer 400, feature values are assigned to those composite features 
identified (step 420). The feature values are assigned by a machine learning 
algorithm, such as RIPPER, which has been trained on a suitable training corpus. As 
25 in the case of primitive features, optionally, the feature values assigned to the 

composite feature can be normalized for text length and relative occurrence of the 
primitive feature or composite feature (steps 425, 430, respectively). Once both 
primitive features and composite features of the small text segments have been 
identified, a machine learning algorithm is applied to determine a similarity value 
30 between the text segments (step 435). The machine learning algorithm can perform a 
rule-based analysis to determine similarity. Alternatively, a simpler algorithm can be 



wo 00/79426 RF^^T A\/AU Am , L PCT/USOO/40238 



BEST AVAfLAOLE COPY 



used to determine similarity by comparing the total feature value of the text segments 
being compared to a predetermined threshold value. 

Figure 5 is a block diagram of an exemplary software system for conducting 
the method described in connection with Figures 1-4. The system is generally 

5 implemented in software for a general purpose computer, such as a personal computer 
or work station. The system includes a main processing section 500. One or more 

mtcrfa ec- m e dtri e s - 510 aic hicludeJ foi leceiviiig text inpiii fo r therteXT StegnlerltgtC — 

be compared and for providing the text segments to the main processing section 500. 
The text input can be provided by a number of sources, including but not limited to, 

1 0 computer readable memory, hard disks, optical disks, network databases, on-line 

sources, manual keyed input and the like. Based on the desired text source and input 
mechanism, one skilled in the art can provide appropriate text input interface module 
5 1 0 hardware and software. 

The main processing section 500 is also operatively coupled to a training 

1 5 corpus 515, which is generally stored in computer readable storage media. The main 
processing section 500 is generally programmed in a structured manner which calls 
various subprograms, library routines, and the like to perform the various functions 
described in accordance with Figures 1-4. The main processing section 500 can 
invoke the various subroutines sequentially (serial) or in a parallel, or batched, 

20 processing mode. The received text is generally passed to a preprocessing routine 
520. The preprocessing routine cleans up the received text, such as by removing 
control characters from the text. The preprocessing routine also performs part-of- 
speech (POS) tagging, using known techniques, such as are available in the 
ALEMBIC tool set, described by Aberdeen et al. in "MITRE: Description of the 

25 Alembic System as used for MUC-6," Proceedings of the Sixth Message 

Understanding Conference, 1995, which is hereby incorporated by reference. 
ALEMBIC provides a set of data and language processing tools which identify the 
various parts of speech present in the small text segments. 

Following text preprocessing, control is returned to the main processing 

30 section 500 which then preferably invokes a noun phrase comparison subroutine 525, 
such as Linkit, to perform noun phrase comparison of step 3 1 0. Linkit can be 
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employed to determine whether a common noun phrase is present in the applied text 
segments and for identifying simplex noun phrases and matching those that share the 
same noun head. The Linkit tool is described by N. Wacholder in "Simplex NPs 
Clustered by Head: A Method for Identifying Significant Topics in a Document", 

5 Proceedings of the Workshop on the Computational Treatment of Nominals, October 
1998, which is hereby incorporated by reference in its entirety. 

To deiemiine if iwo segments include common proper noims as required in 
step 325, the noun comparison algorithm can also be used to match those nouns 
identified using the ALEMBIC toolset using various predetermined matching criteria. 

1 0 Variations on proper noun matching can include restricting the proper noun type to a 
person, place or organization. Such subcategories can also be extracted using 
ALEMBIC'S named entity finder. 

Following noun phrase identification and matching, other routines for 
detecting primitive features can be employed. For example, to perform step 305 and 

1 5 determine whether common single word primitive features exist between two text 
segments, a word co-occuirence detection sub-routine 540 can be called by the main 
program 500. Variations of the word co-occurrence operation can restrict matching to 
cases where the parts of speech of the words also match, or relax the comparison to 
cases where only the word stems of the two words are identical. 

20 Similarly, to determine if two text segments include words which are 

synonyms, a synonym detection algorithm 530 can be called by the main processing 
routine 500. In this regard, a lexical database such as WordNet®, as described by G. 
Miller in "WordNet, An On-Line Lexical Database," International Journal of 
Lexicography, Vol. 3, No. 4 (1990), can be employed. WordNet provides sense 

25 information and places words in sets of synonyms (synsets). Words that appear in the 
same synset are generally considered matches. Variations on this feature can be used 
to restrict the words being compared to a specific part-of-speech class. 

To detennine if two verbs present in the short text segments are of the same 
semantic class as set forth in step 320, a verb classifier and comparator algorithm 535 

30 can be operatively coupled to the main processing section 500 and called by the main 
program. Semantic classes for verbs have been found to be usefiil for determining 
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document types and text similarity. This is discussed, for example, in "The Role of 
Verbs in Document Analysis" by J. Klavans et al.. Proceedings of the 36th Annual 
Meeting of the Association for Computational Linguistics and the 17th International 
Conference on Computational Linguistics, 1998, which is hereby incorporated by 
reference in its entirety. For those verbs which are found to have a common semantic 
class, e.g., communication, motion, agreement, argument, etc., those verbs are 

a>ift Wbteti Tb-mmch: 

The program operating in main processing section 500 can also provide 
algorithms to normalize feature values for text lengths and relative occurrence of the 
primitive. To normalize feature values for text length, as set forth in step 335, each 
feature value can be normalized by the size of the textual segments in the pair. For 
example, for a pair of textual segments A and B, the feature values assigned are 
divided by a normalization value, N: 



This operation removes any potential bias in favor of longer text segments. It is noted 
that the units involved in the lengths of A and the lengths of B are generally measured 
by a word count. 

Normalization of feature values can also be based on the relative frequency of 
occurrence of each primitive feature. Such normalization is motivated by the general 
observation that infrequently matching primitive elements are likely to have a higher 
impact on similarity than primitives which match more frequently. Such 
normalization is similar to the document frequency component of the commonly 
employed TF*IDF calculation. In this case, each primitive feature is associated with a 
value which is equal to the munber of textual units in which the primitive appeared in 
the corpus. For a primitive element which compares single words, this is the number 
of text segments which contain that word in the corpus; for a noun phrase, this is the 
number of textual units that contain noun phrases that share the same head; and 
similarly for other primitive types. We multiply each feature's value by: 




(1) 




(2) 
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where T is a number of textual segments and N is the number of textual segments 
containing the primitive. It is noted that since normalization for text length and 
frequency of occurrence are both optional operations, when these two normalization 
techniques are selectively applied, there are up to four variations of normalizations for 

5 each primitive feature. Of course, other normalization techniques may be added to, or 
substituted for, the two methods discussed herein. 

1 he program in main processing section 500 generally employs a machme 
learning algorithm 545 to determine whether the text imits match overall. A suitable 
machine learning algorithm is RIPPER, as disclosed by Cohen in "Learning Trees and 

10 Rules with Set-Valued Features, Proceedings of the Fourteenth National Conference 
on Artificial Intelligence, American Association on Artificial Intelligence, 1996, 
which is incorporated by reference. RIPPER is a widely-used and effective rule 
induction system. This RIPPER algorithm is trained over a corpus of manually 
marked pairs of text imits continued in the training corpus 515. A suitable corpus was 

1 5 constructed using a subset of the Topic Detection and Tracking (TDT) corpus 
developed by NIST and DARPA. The TDT corpus in a collection of over 16,000 
news articles from Reuters and CNN where many of the articles have been manually 
grouped into 25 categories each of which correspond to a single event. The selected 
corpus was formed using the Reuters' articles in five of the twenty five categories 

20 from randomly selected days. The resulting training corpus 515 contained 30 related 
articles. The 30 articles provided 264 paragraphs which were selected as the small 
text segments and resulted in 10,345 comparisons between segments. 

Although use of a machine learning algorithm is preferred, other algorithms 
can also be used. For example, an algorithm can add the total value of composite 

25 features found in the text segments and compare this value agamst a similarity 

threshold. Similarly, although it is preferred to determine feature values based on the 
use of a machine learning algorithm, feature values can be predetermined based on 
human experience through the use of a look-up table. Alternatively, all features can 
be given a binary value and the similarity comparison can be determined based on a 

30 simple accumulated count of detected primary and composite features. 
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The present methods, while evaluated on a corpus of English language 
documents, arc not language specific and are generally applicable to any language. Of 
course, the individual subroutines may require some alteration to accommodate the 
varied constructions foimd in different languages. 
5 The methods for determining similarity in small text segments described 

herein fomi an important component in larger systems, such as document archiving 
systems and multi-document summarization systems. 

Although the present invention has been described in connection with specific 
exemplary embodiments, it should be understood that various changes, substitutions 
10 and alterations can be made to the disclosed embodiments without departing firom the 
spirit and scope of the invention as set forth in the appended claims. 
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CLAIMS 

1 . A method for determining similarity in short text segments comprising: 

determining common primitive features in the text segments; 
determining common composite features in the text segments; 

5 and 

' calculating a similarity measure based upon said primitive and 

composite features. 

2. The method for determining similarity as defined by claim 1 , wherein 
10 said primitive features are selected from the group including common single word, 

conmion noim phrase, synonyms, conmion semantic class of verbs, and common 
proper novins. 

3. The method for determining similarity as defined by claim 1, wherein 
said composite features are selected from the group including primitive feature order 

1 5 restrictions, primitive distance restrictions, and primitive type restrictions. 

4. The method for determining similarity as defined by claim 1 , wherein 
said step of determining common primitive features includes: 

identifying common primitive features; 
assigning a value to said primitive features; and 
20 normalizing said value. 

5. The method for determining similarity as defined by claim 4, wherein 
said step of normalizing includes at least one of normalizing for text segment length 
and normalizing for frequency of primitive occurrence. 

6. The method for determining similarity as defined by claim 1 , wherein 
25 said step of determining common composite features includes: 

identifying common primitive features; 
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assigning a value to said primitive features; and 
normali2ing said value. 

7. The method for determining similarity as defined by claim 6, wherein ^ 
said step of normalizing includes at least one of normalizing for text segment length 

5 and normalizing for frequency of primitive occurrence. 

8. A system for determining similarity in short text segments comprising: 

an interface circuit for receiving text segments for comparison; 

a main processing section, the main processing section being 
operatively couple to the interface circuit and operating under the control of a 
10 computer program, the program performing operations to determine conunon 

primitive features in the text segments, determine common composite features in the 
text segments; calculate a similarity measure based upon said primitive and 
composite features, and provide an output indicative of the similarity measure. 

15 9. The system for determining similarity as defined by claim 8, wherein 

said primitive features are selected from the group including common single word, 
common noun phrase, synonyms, common semantic class of verbs, and common 
proper nouns. 

10. The system for determining similarity as defined by claim 8, wherein 
20 said composite features are selected from the group including primitive feature order 

restrictions, primitive distance restrictions, and primitive type restrictions. 

1 1 . The system for determining similarity as defined by claim 8, wherein 
the processing operation of determining common primitive features includes: 

identifying common primitive features; 
25 assigning a value to said primitive features; and 

normalizing said value. 
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12. The system for determining similarity as defined by claim 1 1 , wherein 



the processing operation of normalizing includes at least one of normalizing for text 
segment length and normalizing for frequency of primitive occurrence. 

13. The system for determining similarity as defined by claim 8, wherein 
5 said processing operation for determining conunon composite features includes: 

identifying conunon primitive features; 
assigning a value to said primitive features; and 
normalizing said value. 



14. The system for determining similarity as defined by claim 13, wherein 
10 said processing operation for normalizing includes at least one of normalizing for text 

segment length and normalizing for frequency of primitive occurrence. 

1 5. The system for determining similarity as defined by claim 8, wherein 
the computer program includes a noun phrase identification subroutine, a synonym 
detection subroutine, a verb classifier subroutine and a word co-occurrence 

15 subroutine. 



1 6. The system for determining similarity as defined by claim 8, further 
comprising a computer readable training corpus, and wherein the computer program 
includes a machine learning algorithm operatively coupled to the training corpus for 
learning and applying a rule set for determining similarity in small text segments. 
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Figure 1: Input text units (from the TDT irilot- 
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the applicable national laws. 

The time limit for performing these procedural acts is 20 MONTHS from the priority date or, for those designated States 
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the filing date of the international application. (See second paragraph above.) 
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the right to file a demand for international preliminary examination. 
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of a fee (Rule 17.1(b)). 

If the priority document concerned is not submitted to the International Bureau or if the request to the receiving Office 
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document within a time limit which is reasonable under the circumstances. 

Where several priorities are claimed, the priority date to be considered for the purposes of computing the 1 6-month time 
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or 30 months, or later in some Offices, perform the acts referred to therein before each designated or elected Office. 

For further important information on the time limits and acts to be performed for entering the national phase, see the 
Annex to Form PCT/IB/301 (Notification of Receipt of Record Copy) and Volume II of the PCT Applicant's Guide. 





The International Bureau of WlPO 
34. chemin des Colombettes 
1211 Geneva 20, Switzerland 


Authorized officer • J 

J. Zahra 


Facsimile No. (41-22) 740.14.35 


Telephone No. (41-22) 338.83.38 


Form PCT/IB/308 (July 1996) , 


3737415 



r 

v.. 



PCT/USOO/40238 



PATENT COOPERATION TREATY 

From the INTERNATIONAL BUREAU 



PCT 

NOTIFICATION OF ELECTION 
(PCT Rule 61.2) 


To: 

Commissioner 

US Department of Commerce 
United States Patent and Trademark 
Office, PCT 

201 1 South Clark Place Room 

CP2/5C24 

Arlington, VA 22202 

in its capacity as elected Office 


Date of mailing (day/month/year) 

27 November 2001 (27.11.01) 




International application No. 
PCT/USOO/40238 


Applicant's or agent's file reference 
32550-PCT 


International filing date (day/month/year) 
19 June 2000(19.06.00) 


Priority date (day/month/year) 
18 June 1999 (18.06.99) 


Applicant 

KLAVANS, Judith, L. et al ^ 



1. The designated Office is hereby notified of its election made: 

[ X[ in the demand filed with the International Preliminary Examining Authority on: 
03 January 2001 (03.01.01) 

j I in a notice effecting later election filed with the International Bureau on: 



2. The election 



0 
□ 



was not 



made before the expiration of 19 months from the priority date or, where Rule 32 applies, within the time limit under 
Rule 32.2(b). 



The International Bureau of WlPO 


Authorized officer 








34, chemin des Colombettes 


Imelda REHS 




121 1 Geneva 20, Switzerland 






Facsimile No.: (41-22) 740.14.35 


Telephone No.: (41-22) 338.83.38 




Form PCT/IB/331 (July 1992) 




4490862 



TRANSMITTAL LETUT TO THE 
UNITED STATES RECEIVING OFFICE 



Date ^ 


3. January 2001 


Iniemaiional Application i ... 


PCT/USOO/40238 


Attorney Docket No. 


32550-PCT 



Certification under 37 CFR 1.10 (if applicable) 



JC13 Rec'd PCT/PTO | 3 DEC 2001 



EK839852479US 



Express Mail mailing number 



3 January 2001 



Date of Deposit 



I hereby certify that the application/correspondence attached hereto is being deposited with the United States Postal Service 
"Express Mail Post Offlce to Addressee'* service under 37 CFR 1.10 on the date indicated above and is addressed to Assistant 

Cor 




Patents, )Vash4|igton, D.C. 20231. 



Signature of person mailing correspondence 



Leroy Chick 



Typed or printed name of person mailing correspondence 



II. New International Application 




Earliest priority date 
(Day/Month/Year) 



SCREENING DISCLOSURE INFORMATION: In order to assist in screening the accompanying international 
application for purposes of determining whether a license for foreign transmittal should and could be granted and for 
other purposes, the following information is supplied. (Note: check as many boxes as apply): 

A. Q The invention disclosed was not made in the United States. 

B. Q There is no prior U.S. application relating to this invention. 

C. Q The following prior U.S. application(s) contain subject matter which is related to the invention disclosed in the attached 

international application, (NOTE: priority to these applications may or may not be claimed on form PCT/RO/iOI 
(Request) and this listing does not constitute a claim for priority). 



application no. 




filed on 




application no. 




filed on 





' — * application(s) identified in paragraph C. 
E. Q The present international application Q contains additional subject matter not fou nd in the prior U.S. application(s) 
identified in paragraph C. above. The additional subject matter is found on pages 



and □ DOES NOT ALTER □ MIGHT BE CONSIDERED TO ALTER the general nature of the invention in a 
manner which would require the U.S. application to have been made available for inspection by the appropriate defense 
agencies under 35 U.S.C. 181 and 37 CFR 5.1. See37CFR5.15 ^^^^^^ 



III. 1^ A Response to an Invitation from the RO/US. The following document{s) is (are) enclosed: 
A. Q A Request for An Extension of Time to File a Response 
A Power of Attorney (General or Regular) 
Replacement pages: 



B. □ 

c. •□ 



pages 




of the request (PCT/RO/101) 


pages 




of the figures 


pages 




of the description 


pages 




of the abstract | 


pages 




of the claims 





D. n Submission of Priorit>' Documents 



Priority document 



Priority document 



E. n Fees as specified on attached Fee Calculation sheet form PCT/RO/101 annex 



IV. LI A Request for Rectification under PCT 91 A Petition 



A Sequence Listing Diskette 



V Kl Other (please specify)- Demand for Intemational Preliminary Examination (4 sheets), Fee Calculation Sheet, a postcard and a 
* check In the amount of $627. 



The person 
signing this 
form is the: 



1 I Applicant 


Paul D. Ackerman 


1^ Attorney/Agent (Reg. No.) 
39.891 


Typed name of signer 




1 1 Common Representative 


Signature 



PTO-1382 (Rev. 4-1995) 



Cbpyrisht l996Legalsofi 



U.S. Department of Commerce: Patent and Trademark Office 



77ie demand must be fded direcdy 
with the one chosen by the applicant, 

IPEA/ JJS 



I competent International Preliminary Examining r. >rity or. if two or more Authorities are 
The fall name or two-letter code of that Authority may be indicated by the applicant on the line 



PCX 

DEMAND 

under Article 3 1 of the Patent Cooperation Treaty: 
The undersigned requests that the international application specified below be the subject of 
international preliminary examination according to the Patent Cooperation Treaty and 
hereby elects all eligible States (except where otherwise indicated). 



CHAPTER II 



Idenlificaiion of IPEA 


Date of receipt of DEMAND 


Box No. I IDENTIFICATION OF THE INTERNATIONAL APPLICATION 


Applicant's or agent's file reference 
32550-PCT 


International application No. 
PCT/USOO/40238 


International filing date (day/month/year) 
19 June 2000 ( 19.06.00 ) 


(Earliest) Priority date (day/month/year) 
18 June 1999 ( 18.06.99 ) 



Title of invention 

SYSTEM AND METHOD FOR DETECTING TEXT SIMILARITY OVER SHORT PASSAGES 



Box No. II APPLICANT(S) 



Name and address: (Family name followed by given name; for a legal entity, full official Telephone No.: 
designation. The address must include postal code and name of country.) 

THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK Facsimile No.: 
1 1 6th Street and Broadway 
New York, NY 10027 

Teleprinter No.: 



State (that is, country) of nationality: 
US 



State (that is, country) of residence: 
US 



Name and address: (Family name followed by given name; for a legal entity, full official designation. The address must include postal code and 
name of country,) 



KLAVANS, JUDITH L. 
40 South Drive 

Hasting-on-Hudson, NY 10706 
US 



State (that is, country) of nationalit>': 
US 



State (that is, country) of residence: 
US 



Name and address: (Family name followed by given name; for a legal entity, full official designation. The address must include postal code and 
name of country.) 



ESKIN. ELEAZAR 
935 Stanford Street 
Santa Monica. CA 90403 
US 



State (that is, country) of nationality: 
US 



State (that is. country) of residence: 
US 



X Further applicants are indicated on a continuation sheet. 



Form PCT/IPEA/40I (first sheet) (July 1998; reprint July 2000) 



LegalStsr 2000, Form PCTDEM 



See Notes to the demand form 





International application No. 


Sheet No. .?. 


PCT/USOO/40238 



Continuation of Box No. II APPUCANT(S) 



If none of the following sub-boxes is used, this sheet is not to be included in the demand. 



Name and address: (Family name followed by given name; for a legal entity, full official designation. The address must include postal code and 
name of country.) 

HAT2IVASSILOGLOU. VASILEIOS 
452 Riverside Drive, Apt. 41 
New York, NY 10027 
US 



State (that is, country) of nationality: 
GR 



State (that is, country) of residence: 
US 



Name and address: (Family name followed by given name; for a legal entity, full official designation. The address must include postal code and 
name of country.) 



State (that is, country) of nationality: 



State (that is, country) of residence: 



Name and address: (Family name followed by given name; for a legal entity, full official designation. The address must include postal code and 
name of country.) 



State (that is. country) of nationality: 



State (that is. country) of residence: 



Name and address: (Family name followed by given name; for a legal entity, full official designation. The address must include postal code and 
name of country.) 



State (that is. country) of nationality: 



State (that is. country) of residence: 



I I Further applicants are indicated on another continuation sheet. 



Form PCT/IPEA/401 (continuation sheet) (July 1998; reprint July 2000) 



LegalStar 2000. Form PCTDEM ^^^^^ ^/^^ demand form 



Sheet No. 


International application No. 

HATZIVASSILOGLOU. 


Box No. Ill AGENT OR COMMON REPRESENTATIVE; OR ADDRESS FOR CORRESPONDENCE 


The following person is Kl agent [ZJ common representative 

and [21 has been appointed earlier and represents the applicant(s) also for international preliminary examination. 

1 1 is hereby appointed and any earlier appointment of (an) agent(s) /common representative is hereby revoked. 

1 1 is hereby appointed, specifically for the procedure before the International Preliminary Examining Authority, in 

' — ' addition to the agent{s)/common representative appointed earlier. 


Name and address: (Family name followed by given name; for a legal entity, full official 
The address must include postal code and name of country.) 

TANG. HENRY and 
ACKERMAN, PAUL D. 
Baker Botts LLP 
30 Rockefeller Plaza 
New York. NY 10112 
US 


Telephone No.: 
(212) 705-5000 


Facsimile No.: 
(212) 705-5020 


Teleprinter No.: 



□ Address for correspondence: Mark this check-box where no agent or common representative is 
the space above is used instead to indicate a special address to which correspondence should be 



is/has been appointed and 
sent. 



Box No. IV BASIS FOR INTERNATIONAL PRELIMINARY EXAMINATION 



Statement concerning amendments:* 

1. The applicant wishes the international preliminary examination to start on the basis of: 
IX] the international application as originally filed. 

the description I I as originally filed 

I I as amended under Article 34 

the claims as originally filed 

I I as amended under Article 19 (together with any accompanying statement) 
I I as amended under Article 34 

the drawings as originally filed 

I I as amended under Article 34 

2. I I The applicant wishes any amendment to the claims under Article 19 to be considered as reversed. 

3 I I The applicant wishes the start of the international preliminary examination to be postponed until the expiration of 
20 months from the priority date unless the International Preliminary Examing Authority receives a copy of any 
amendments made under Article 19 or a notice from the applicant that he does not wish to make such amendments 
(Rule 69.1(d)). (This check-box may be marked only where the time limit under Article 19 has not yet expired.) 
* Where no check-box is marked, international preliminar\' examination will start on the basis of the international application as 
originally filed or, where a copy of amendments to the claims under Article 19 and/or amendments of the international 
application under Article 34 are received by the International Preliminary Examining Authority before it has begun to draw up 
a written opinion or the international preliminary examination report, as so amended. 



Language for the purposes of international preliminary examination: English 
1X3 which is the language in which the international application was filed. 
I I which is the language of a translation furnished for the purposes of international search. 
I I which is the language of publication of the international application. 

I I which is the language of the translation (to be) furnished for the purposes of international preliminary examination. 



Box No. V ELECTION OF STATES 



The applicant hereby elects all eligible States (that is, all States which have been designated and which are bound by Chapter II of the 
PCT) 

excluding the following States which the applicant wishes not to elect: 



Form PCT/IPEA/40 1 (second sheet) (July 1 998; reprint July 2000) Legaistar 2000. Form pctdem demand form 



















International application No. 




o 

Sheet No. .Y. 




HATZIVASSILOGLOU. 


Box No. VI CHECK LIST 


The demand is accompanied by the following elements, in the language 
Box No. IV, for the purposes of international preliminary examination: 


referred to in 


For International Preliminary 
Examining Authority use only 

received not received 


1 . translation of international application 




sheets 


□ 


□ 


2. amendments under Article 34 




sheets 


□ 


□ 


3. copy (or where required, translation) 
of amendments under Article 19 




sheets 


□ 


□ 


4. copy (or, where required, translation) 
of statement under Article 1 9 




sheets 


□ 


□ 


5. letter 




sheets 


. □ 


□ 


6. other (specify) 




sheets 


□ 


□ 



The demand is also accompanied by the item(s) marked below: 

1. fee calculation sheet 

2. [ I separate signed power of attorney 

3 I I copy of general power of attorney; 
l_l reference number, if any: 



4. I I statement explaining lack of signature 

5 I I nucleotide and or amino acid sequence listing in 
' 1 computer readable form 

other (specify): Transmittal Letter 



Box No. VII SIGNATURE OF APPLICANT, AGENT OR COMMON REPRESENTATIVE 



Next to each signature, indicate the name of the person signing and the capacity in which the person signs (if such capacity is no 
obvious from reading the demand). 



Paul D. Ackerman (Agent) 

For Intemational Preliminary Examining Authority use only 

1 . Date of actual receipt of DEMAND: 



2. Adjusted date of receipt of demand due 
to CORRECTIONS under Rule 60. 1(b): 



3. 


□ 


The date of receipt of the demand is AFTER the expiration of 1 9 months 1 1 The applicant has been 
from the priority date and item 4 or 5, below, does not apply. ' — ' informed accordingly. 


4. 


□ 


The date of receipt of the demand is WITHIN the period of 19 months from the priority date as extended by virtue of 
Rule 80.5. 


5. 


□ 


Although the date of receipt of the demand is after the expiration of 19 months from the priority date, the delay in arrival i 
EXCUSED pursuant to Rule 82. 



For Intemational Bureau use only 



Demand received from IPEA on: 



Form PCT/IPEA/401 (last sheet) (July 1998; reprint July 2000) 



LegalStar 2000. Form PCTDEM y^^^^^ demand for. 



PCT 



CHAPTER II 



FEE CALCULATION SHEET 
Annex to the Demand for international preliminary examination 

- For International Preliminary Examining Authority use only 



International 
application No. 



PCT/USOO/40238 



Applicant's or agent's 
file reference 



32550.PCT 



Date stamp of the IPEA 



Applicant 

THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK 



Calculation of prescribed fees 



1. Preliminary examination fee 

2. Handling fee (Applicants from certain States are 
entitled to a reduction of 75% of the handling fee. 
Where the applicant is (or all applicants are) so 
entitled, the amount to be entered at H is 25% of the 
handling fee.) ', 

3.. Total of prescribed fees 

Add the amounts entered at P and H 

and enter total in the TOTAL box 



490.00 P 



137.00 H 



627.00 



TOTAL 



authorization to charge deposit 
account with the IPEA (see below) 



Mode of Payment 

□ 

IXI cheque 

I I postal money order 

I I bank draft 



I I cash 

I I revenue stamps 

I I coupons 

I I other (specify): 



Deposit Account Authorization (this mode of payment may not be available at all IPEAs) 

The IPEA/ US j j is hereby authorized to charge the total fees indicated above to my deposit account. 

(this check-box may be marked only if the conditions for deposit accounts of the IPEA so permit) is 
hereby authorized to charge any deficiency or credit any overpayment in the total fees indicated 



02-4377 



above to my deposit account. 



3 January 2000 



Deposit Account Number 



Date (day/montli/year) 



Signature 



Form PCT/IPEA/401 (Annex) (July 1998; reprint July 2000) 



LegalStar 2000. Form PCTDFEE 



See Notes to the fee calculation sheet 



PATENT COOPERATION TREATY 



From the 

INTERNATIONAL PRELIMINARY EXAMINING AirTHORITY 



To: 

HENRY TANG 

BAKER BOTTS LLP 

30 ROCKEFELLER PLAZA 

NEW YORK. NY 10112 0228 



PCX 




NOTIFICATION OF RECEIPT ; 
OF DEMAND BY COMPETENT INTERNATIONAL 
PRELIMINARY EXAMINING AUTHORITY 

(PCT Rules 59.3(e) and 6 1 . l{b), first sentence 
and Administrative Instructions, Section 60 1(a)) 



Daieofmailing 
(day/monih/yeat) 



25SEP ?noi 



Applicanrs or agcm's file relerence 

32550-PCT 



IMPORTANT NOTinCATION 



Iniemaiional applicaiion No. 

PCT/USOO/40238 



Iniemaiional filing dale (day'monthyear) 
19 JUN 00 



Pricriiy dale (dayfm onthyear) 
18 JUN 99 



Applicani 

THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF 



The applicani is hereby notlDed ihai ihis Iniemaiional Preliminar>' Examining Auihoriiy considers ihe following dale as ihedaie 
orreceipi ol'ihe demand for iniemaiional preliminary examinaiipttjnfihe iniemaiional application: 



examinaiK>ttjni me iniemauonai applies 



2. Thai dale of receim is: 

/ 

rri/me aciual dale of receipt of ihe demand by ihis Auihoriiy (Rule 61.1(b)). 

[ I ihe aciual dale of receipi of the demand on behalf of ihis Auihoriiy (Rule 59.3(e)). * . 

[ I ihe dale on which ihis Auihoriiy has, in response lo ihe inviiaiion to correct defects in ihe demand (Form 
PCT/IPEA/404), received the required corrections. 



3. I I ATTENTION: That date of receipi is AFTER the expiration of 19 months from the priority dale. Consequently, the 
eleciion(s) made in the demand does (do) not have ihe elTeci of postponing Ihe entry inio the national phase until 30 months 
from Ihe prioriiy date (or later in some Olllces) (Article 39(1 )). Therefore, ihe acis for enii7 inio ihe national phase must 
be perfomied within 20 monihs from ihe prioriiy date (or laier in some Offices) (Article 22). For deiaiis. sec the PCT 
Applicant *v Guide, Volume II. 

I I (If applicable) This noiificaiion conlimis the infonnaiion given by telephone, facsimile transmission or in person on:' 



4. Only where paragraph 3 applies, a copy of this noiificaiion has been seni lo ihe Iniemaiional Bureau 



Docketed 
For / Z:) /2CC} 



1 



Name and mailing address of ihe I PEA/ 
Assistant Commissioner for Patent 
Box PCT 

Washingion, D.C. 20231 Attn:ROAJS 
FacsimileNo, 703-305-3230 



Authorized iaJllcer ~ 7/ / 7/ 

Telephone No. 703 



Fomi PCT/IPEA/402 (July 1998) 




PATENT COOPERATTON TREATY 

PCX 

INTERNATIONAL PRELIMINARY EXAMINATION REPORT 
(PCX Artide 36 and Rule 70) 




Applicant's or agent's file reference 
32550-PCr 



International application No. 
PCT/USOO/40238 



FOR FURTHER ACTION 



See Notification of Transmittal of International 
Preliminary Examination Report (Form PCT/IPEA/416) 



International filing date (day/momh/year) 
19 June 2000 (19.06.2000) 



Priority date (day/momh/year) 
18 June 1999 (18.06.1999) 



International Patent Classification (IPC) or national classification and IPC 



RbGblVbD 



IPC(7): G06F 17/21 and US CI.: 704/10, 1, 9; 707/6.531, 532 



Applicant 

THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF 



rCD 0 3 ZQ03 



Technology Center 2600 



1 . This international preliminary examination report has been prepared by ttiis International Preliminary 
Examining Authority and is transmitted to the applicant according to Article 36. - 

2. This REPORT consists of a total of _3_ sheets, including this cover sheet. 

I I This report is also accompanied by ANNEXES, i.e., sheets of the description, claims and/or drawings 
, which have been amended and are the basis for this report and/or sheets containing rectifications made 
before this Authority (see Rule 70.16 and Section 607 of the Administrative Instractions under the PCT). 

These annexes consist of a total of 0 sheets. 



This report contains indications relating to the following items: 
Basis of die repon 



I 


12^ 


n 


□ 


m 


□ 


IV 


□ 


V 




VI 


□ 


vn 


□ 


VIII 


□ 



Reasoned statement under Article 35(2) with regard to novelty, inventive step or industrial 
applicability; citations and explanations supporting such statement 



Date of submission of the demand 
03 January 2001 (03.01.2001) 



Name and mailing address of the IPEA/US 
Commissioner of Patents and Trademarks 
Box PCT 

Washington, D.C. 20231 
Facsimile No. (703)305-3230 



Date of completion of this report 
08 September 2002 (08.09.2002) 




zed^^ber^ 
Mirsha D. Banks-Harold 
Telephone No. 703 3053900 



Forai PCT/IPEA/409 (cover sheei)(July 1998) 



BEST AVAmiE COPY 





INTERN ATTONAl. PRFTJMINARV FYAMTMATfOM PFPOftT 


Intemaaonal ^plication No. 
PCTAJS00/4Q238 




I. Basis of the report 




1. With regard to the elements of the mtemational application:* 
the international application as originally filed, 

X description: 

pages 1-12 as originally filed 
pages NONE , filed with the demand 
pages iNUiNJb , nicu with the letter ol 

the claims: 

pages 13-15 . as originallv fiJed 

pages NONE . as amended ftoeether with any statement) nnder ArririP io 
pages NONE . filed with the demand 






pages NONE , filed with the letter of 

1^ the drawings: 

pages 1-8 , as originallv filed 
pages NONE . filed with the demand 
pages NONE . filed with ttie letter of 

1 1 the sequence listing part of the description: 
pages NONE . as originaUv filed 
pages NONE . filed with the demand 
pages NONE , filed with the letter of 

2. With regard to the language, all the elements marked above were available or fiimished to this Aufliority in the 
language in which the international application was filed, unless otherwise mdicated under diis item. 

These elements were available or furnished to this Audiority in the following language which is: 

1 1 the language of a translation furnished for the purposes of international search (under Rule23.1(b)). 
1 1 the language of publication of the international application (under Rule 48.3(b)). 

Q the language of the translation furnished for the purposes of international preliminary examination(under Rules 
55.2 and/or 55.3). 

3. With regard to any nucleotide and/or amino acid sequence disclosed in the international application, the 
mtemational prelimmary exaimnation was carried out on the basis of die sequence listing: 

1 1 contained in the international application m printed foim. 

1 1 filed together with the mtemational application in computer readable form. 

1 1 furnished subsequently to ttiis Authority in written form. 

1 1 furnished subsequentiy to fliis Authority in computer readable forai. 

Q The statement Oiat the subsequentiy furnished written sequence listing does not go beyond die disclosure in the 
international application as filed has been furnished. 

Q The statement Oiat the information recorded in computer readable form is identical to die written sequence listing 
has been furnished. 

4. ^ The amendments have resulted in die cancellation of: 

1^ the description, pages NONE 

1^ the claims. Nos. NONE 

^ die drawings, sheets/fig NONE 

5. This report has been established as if (some of) die amendments had not been made, since they have been considered to go 
bevond the disclostine as filed ai! indicatprf in th^ ^mnlpmpnfal Pnv /Diil» m Off*W 

* Replacement sheets which fiave been famished to tlte receiving Office ui response to an invitation under Article 14 are referred to in 
this report as "origimlly filed" and are not annexed to thi^report since tJiey do not contain amendtnents (Rules 70. J 6 and 70.17), 
** Any replacemem slieet contaimng svdi a/nendtnents must be referred to under it&n 1 and annexed to tins report. 





Fonn PCT/IPEA/409 (Box I) (July 1998) 




INTERNATIONAL PRELIMINARY EXAMINATION REPORT 



Intemational appiication No. 
PCT/US00/4Q238 





V. Reasoned statement under Rule (6.2(a)(u) with regard to novelty, inyentive step or industrial applicability; 
citations and explanations supporting such statement 




1. STATEMENT 

Novelty (N) Claims 4.7aiidlM4 YES 

Claims 1-3,8-10 and 15-16 NO 

Inventive Step (IS) Claims NONE YES 

Claims 1-16 NO 




Claims NONE NO 




2. CITATIONS AND EXPLANATIONS 

I. Qaims 1-3, 8-10, and 15-16 lack novelty under PCT Article 33(2) as being anticipated by Kupiec (US 5,696.962 A). 

(A) As per claim 1 , Kupiec teaches a method for computerized information retrieval using shallow linguistic analysis, and for 
delennining similarity in text segments (Kupiec; col. 33, lines 12-23). comprising the steps of: 

"determining features such as parl-of- speech information, noun phrase, verbs, synonyms, and hyponyms(ieads on "primitive 
features") (Kupiec; col. 7, line 63 to col. 10, line 8. and col. 12, lines 51-65); 

determining feamres such as proximity, order, and constraints (reads on "composite features") (Kupiec; col. 6, lines 45-57; and col. 
33, line 25 to col. 34, line 55); and 

matching like i^irases by calculating similarity measures therefrom (Kupiec; col. 3, line 44-45; col. 32, line 64 to col. 33, line 23; 
and col. 35, line 5 to col. 36. line 43). . 

(B) As per claim 2, note col. 7, line 63 to col. 10, line 8 and col. 12, lines 51-65 of Kupiec. 
(Q As per claim 3, note col. 6, lines 45-57 and col. 33, line 25 to col. 34, line 55 of Kupiec. 

(D) Claims 8-10 differ from claims 1-3 by reciting system elements such as an interface circuit and a main processing section 
operating under the control of a computer program. As per these limitations, Kupiec system has a user interface (7) and runs on a 
programmed CPU (5) and memory (6) (Kupiec; fig.. 1 and col. 5, lines 42-58). The remaining limitations of claims 8-10 are as 
addressed above in tire discussion of claims 1-3, and incorporated herein. 

(E) As per claims 15-16, Kupiec discloses the use of pan-of-speech taggers and phrase recognizers, as well as the training of text via 
a Hidden Maikov Model (HMM) estimations (reads on "machme learning algorithm") (Kupiec; col. 8, lines 46 to col. 10, line 8; and 
col. 39, line 64 to col. 40, line 10). 

I I . Claims 4-7 and 1 1-14 lack an inventive steip under PCT Article 33(3) as being obvious over Kupiec (US 5,696.962 A) in view of 
Schuetze (US 5.675.819 A). 

(A) As per claims 4-5. Kupiec discloses the use of different ranking or prioritization criteria based on the frequency of some words 
within retrieved documents or the text corpus as a whole (Kupiec; col. 26, lines 22-28). but fails to expressly teach the nomializing 
of primitive features leaving assigned values and according to text segment length or frequency of word occurrence. However, this is 
known in the art, as evidenced by Schuelze. 

In particular, Schuetze discloses computing context vectors for word (reads on "assigning values"), and then normalizing the context 
vectors (Schuetze; col. 17, lines 56 to col. 18, line 10 and fig. 10). 

One havmg ordinaiy skill in the art at the mne of the invention would have found it obvious to assign values to the query features 
(such as part-of-speedi information, noun phrase, verbs, synonyms, and hyponyms viiich read on "primitive features") and to 
normalizing these values with the motivation of improving retrieval performance for non-literal matches with queries (Schuetze; col. 
4, hues 13-15). 

(B) Claims 6-7. 11-12, and 13-14 repeat tire same limitations of claims 4-5 are therefore obvious for the same reasons given above 
for claims 4-5. 

NEW CITATIONS 

US 5,696,962 A (KUPIEC) 09 December 1997, see abstract; fig. 1, col. 3. lines 44-45; col. 5, Imes 42-58; col. 6, line 45-57; col. 7, 
line 63 to col. 10, line 8; col. 12, hnes 51-65; col. 26, lines 22-28; col. 32, line 64 to col. 34, Ime 55; coL 35, line 5 to col. 36. Une 
43; and col. 39, line 64 to col. 40, line 10. 
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